[SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface #33754

huaxingao · 2021-08-16T18:07:57Z

What changes were proposed in this pull request?

Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index

This PR adds supportsIndex interface that provides APIs to work with indexes.

Why are the changes needed?

Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this supportsIndex interface is added to let user to create/drop an index, list indexes, etc.

Does this PR introduce any user-facing change?

yes, the following new APIs are added:

createIndex
dropIndex
indexExists
listIndexes

New SQL syntax:


CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList]

    column_index_property_list: column_name [OPTIONS(indexPropertyList)]  [ ,  . . . ]
    indexPropertyList: index_property_name = index_property_value [ ,  . . . ]

DROP INDEX index_name

How was this patch tested?

only interface is added for now. Tests will be added when doing the implementation

SparkQA · 2021-08-16T19:11:33Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47019/

viirya · 2021-08-16T19:35:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala

+class IndexAlreadyExistsException(indexName: String)
+  extends AnalysisException(s"Index '$indexName' already exists")


Add table param?

viirya · 2021-08-16T19:35:50Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+  void createIndex(String indexName,
+                   String indexType,
+                   Identifier table,
+                   FieldReference[] columns,
+                   Map<String, String> properties)
+      throws IndexAlreadyExistsException, UnsupportedOperationException;


Should we use 2-space ident?

Fixed. Thanks!

viirya · 2021-08-16T21:32:40Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

+
+  public Index(
+      String indexName,
+      String indexType,


What indexType could be? Is it up to data source implementation?

It's up to the data source implementation.

Do we have example index type in mind? String seems vague to me as it can contain arbitrary stuff. Why don't we go with enum or class? Another perspective is why we cannot put the index type inside properties?

The example index types are BLOOM_FILTER_INDEX, Z_ORDERING_INDEX, BTREE_INDEX. I actually started with enum and changed to String.
It's more convenient to have index type outside of properties. in majority of the data sources, the create index syntax is
CREATE [index_type] INDEX index_name ON [TABLE] table_name (column_name [ , . . . ])[OPTIONS indexPropertyList]

@cloud-fan Shall we use enum instead of String for index type? Even though the catalog implementation is responsible for recognizing and taking care of user-specified index type, we still need to add these specific index types in SqlBase? We probably want to use enum to let users know what index types are supported?

Index is for performance only and Spark doesn't need to define the semantic, I think String is more convenient. Data source can define whatever index type they like and ask end-users to use. Spark doesn't need to be in the middle.

SparkQA · 2021-08-16T21:35:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47020/

SparkQA · 2021-08-16T22:13:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47020/

huaxingao · 2021-08-16T22:39:46Z

cc @cloud-fan @wangmiao1981

SparkQA · 2021-08-16T23:16:44Z

Test build #142518 has finished for PR 33754 at commit b89b321.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-17T01:51:46Z

Test build #142519 has finished for PR 33754 at commit f9f4e37.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class IndexAlreadyExistsException(indexName: String, table: Identifier)

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

HyukjinKwon · 2021-08-17T02:41:54Z

@huaxingao would you mind explanining the idea of index? There is also the concept of index in other components e.g.) pandas API on Spark, and I think it's best to disambiguate it by being more explicit.

huaxingao · 2021-08-17T05:27:21Z

@HyukjinKwon Sorry for the confusion. I didn't put enough explanation in the PR's description. I updated the description. Hope it's clear now.

HyukjinKwon · 2021-08-17T05:49:04Z

Oh, okay. so it really means the concept of an index in DBMS's table.

c21

Thanks @huaxingao for the work. Wondering do we have any POC implementation/example/ideas to look at for index support? Thanks.

c21 · 2021-08-17T23:37:05Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

+
+  public Index(
+      String indexName,
+      String indexType,


Do we have example index type in mind? String seems vague to me as it can contain arbitrary stuff. Why don't we go with enum or class? Another perspective is why we cannot put the index type inside properties?

c21 · 2021-08-17T23:39:25Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+   * @throws IndexAlreadyExistsException If the index already exists (optional)
+   * @throws UnsupportedOperationException If create index is not a supported operation
+   */
+  void createIndex(String indexName,


For partitioned table, do we plan to support index creation on table level (for all partitions), or on individual partition level?

This is up to the data source implementation. I think it makes more sense at file level (each data file has an index file).

I prefer to support index creation on individual partition level.

For the existing data in the production environment, if only support index creation on table level, it is likely to be an impossible job for users.

Sorry I think I didn't explain this clear: the index creation is actually done at the underlying data source. It's up to the data source's implementation on which level the index is created. For the implementation in file based data source, I believe the index is created at file level, not at table level or partition level.

Thanks for your explain

huaxingao · 2021-08-18T04:37:54Z

Wondering do we have any POC implementation/example/ideas to look at for index support?

No POC yet. I have some of the APIs implemented using MySQL. Probably can have a POC using delta lake or iceberg in the future.

huaxingao · 2021-09-01T06:20:46Z

@cloud-fan
Here are the ALTER INDEX syntax from major DBMS:

Oracle:
https://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_1008.htm

ALTER INDEX [ schema. ]index
  { { deallocate_unused_clause
    | allocate_extent_clause
    | shrink_clause
    | parallel_clause
    | physical_attributes_clause
    | logging_clause
    }
      [ deallocate_unused_clause
      | allocate_extent_clause
      | shrink_clause
      | parallel_clause
      | physical_attributes_clause
      | logging_clause
      ]...
  | rebuild_clause
  | PARAMETERS ('ODCI_parameters')
  | { ENABLE | DISABLE }
  | UNUSABLE
  | RENAME TO new_name
  | COALESCE
  | { MONITORING | NOMONITORING } USAGE
  | UPDATE BLOCK REFERENCES
  | alter_index_partitioning
  } ;

DB2
https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0054932.html

MS SQL server:
https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-index-transact-sql?view=sql-server-ver15

-- Syntax for SQL Server and Azure SQL Database
  
ALTER INDEX { index_name | ALL } ON <object>  
{  
      REBUILD {  
            [ PARTITION = ALL ] [ WITH ( <rebuild_index_option> [ ,...n ] ) ]   
          | [ PARTITION = partition_number [ WITH ( <single_partition_rebuild_index_option> ) [ ,...n ] ]  
      }  
    | DISABLE  
    | REORGANIZE  [ PARTITION = partition_number ] [ WITH ( <reorganize_option>  ) ]  
    | SET ( <set_index_option> [ ,...n ] )   
    | RESUME [WITH (<resumable_index_options>,[...n])]
    | PAUSE
    | ABORT
}  
[ ; ]

Postgres
https://www.postgresql.org/docs/13/sql-alterindex.html

ALTER INDEX [ IF EXISTS ] name RENAME TO new_name
ALTER INDEX [ IF EXISTS ] name SET TABLESPACE tablespace_name
ALTER INDEX name ATTACH PARTITION index_name
ALTER INDEX name [ NO ] DEPENDS ON EXTENSION extension_name
ALTER INDEX [ IF EXISTS ] name SET ( storage_parameter [= value] [, ... ] )
ALTER INDEX [ IF EXISTS ] name RESET ( storage_parameter [, ... ] )
ALTER INDEX [ IF EXISTS ] name ALTER [ COLUMN ] column_number
    SET STATISTICS integer
ALTER INDEX ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ]
    SET TABLESPACE new_tablespace [ NOWAIT ]

No ALTER INDEX in MySQL

The ALTER INDEX syntaxes are very different, but for multiple columns index created by
CREATE INDEX index_name ON table_name (column1, column2, column3,...), it seems that we can't alter property for certain columns.

SparkQA · 2021-09-01T07:16:09Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47409/

SparkQA · 2021-09-01T08:16:38Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47409/

SparkQA · 2021-09-01T11:03:55Z

Test build #142906 has finished for PR 33754 at commit d4c1931.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-09-03T06:30:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

+ * @since 3.3.0
+ */
+@Evolving
+public class Index {


public final class

shall we name it TableIndex to be more specific?

cloud-fan · 2021-09-03T06:35:41Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

+   * Returns the string map of index properties.
+   */
+  Map<String, String> properties() {
+    return Collections.emptyMap();


this should return properties

cloud-fan · 2021-09-03T06:38:06Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java

+  Map<String, String> properties() {
+    return Collections.emptyMap();
+  }
+}


shall we also add Map<String, String> columnProperties(col: FieldReference)?

cloud-fan · 2021-09-03T06:38:28Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+      throws IndexAlreadyExistsException, UnsupportedOperationException;
+
+  /**
+   * Soft deletes the index with the given name.


what's "soft delete"?

cloud-fan · 2021-09-03T06:39:17Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+  Index[] listIndexes(Identifier table) throws NoSuchTableException;
+
+  /**
+   * Hard deletes the index with the given name.


what's the corresponding SQL syntax for delete and drop index?

cloud-fan · 2021-09-03T06:43:16Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+   * @throws NoSuchIndexException If the index does not exist (optional)
+   * @throws UnsupportedOperationException
+   */
+  default boolean alterIndex(String indexName, Properties properties)


I think we should follow TableCatalog.alterTable

alterIndex(String indexName, IndexChange[] changes)

It's not very clear to me what can be changed in ALTER INDEX though.

SparkQA · 2021-09-08T19:34:57Z

Test build #143092 has finished for PR 33754 at commit 14a819a.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class TableIndex

SparkQA · 2021-09-08T19:38:29Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47594/

SparkQA · 2021-09-08T20:20:47Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47594/

cloud-fan · 2021-09-28T13:30:31Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+ * @since 3.3.0
+ */
+@Evolving
+public interface SupportsIndex extends CatalogPlugin {


shall we follow SupportsPartitionManagement and make it extends Table?

nvm, index has a unique name. DROP INDEX does not need a table.

Actually, not all databases make index name globally unique, see https://www.w3schools.com/sql/sql_ref_drop_index.asp

I think we can still make SupportsIndex extends Table, if the SQL syntax is DROP INDEX index_name ON table_name;

After thinking more about it, I think DROP INDEX index_name ON [TABLE] table_name is better, as it's more consistent with the CREATE INDEX syntax.

This is also more flexible: the index name only need to be unique within the table.

cloud-fan · 2021-09-28T16:21:42Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+import org.apache.spark.sql.connector.expressions.NamedReference;
+
+/**
+ * Catalog methods for working with index


Can we refine the classdoc?

Fixed. Thanks

cloud-fan · 2021-09-28T16:23:02Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+      NamedReference[] columns,
+      Map<NamedReference, Properties>[] columnProperties,
+      Properties properties)
+      throws IndexAlreadyExistsException, UnsupportedOperationException;


UnsupportedOperationException is not a checked java exception, we don't need to put it in the throws clause.

cloud-fan · 2021-09-28T16:23:11Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+   * @param indexName the name of the index to be dropped.
+   * @return true if the index is dropped
+   * @throws NoSuchIndexException If the index does not exist (optional)
+   * @throws UnsupportedOperationException If drop index is not a supported operation


cloud-fan · 2021-09-28T16:23:32Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/SupportsIndex.java

+  boolean dropIndex(String indexName) throws NoSuchIndexException, UnsupportedOperationException;
+
+  /**
+   * Checks whether an index exists.


Suggested change

* Checks whether an index exists.

* Checks whether an index exists in this table.

cloud-fan · 2021-09-28T16:24:59Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/TableIndex.java

+    return properties;
+  }
+
+  Properties columnProperties(NamedReference column) { return columnProperties.get(column); }


do we need this API? people can just get all the column properties as a map and do whatever they want to.

Removed. Thanks!

SparkQA · 2021-09-28T17:04:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48195/

SparkQA · 2021-09-28T17:48:35Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48195/

SparkQA · 2021-09-28T21:12:48Z

Test build #143681 has finished for PR 33754 at commit 120b477.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-09-28T22:21:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48200/

SparkQA · 2021-09-28T23:06:31Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48200/

SparkQA · 2021-09-29T02:26:08Z

Test build #143685 has finished for PR 33754 at commit f6ca2e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-09-29T02:49:24Z

thanks, merging to master!

dongjoon-hyun · 2021-09-29T03:04:30Z

Thank you so much, @huaxingao , @cloud-fan , @viirya , @HyukjinKwon , @c21 , @LuciferYang !

huaxingao · 2021-09-29T03:51:15Z

Thank you very much, everyone!

HyukjinKwon · 2021-10-01T06:03:53Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/TableIndex.java

+  /**
+   * @return the column(s) this Index is on. Could be multi columns (a multi-column index).
+   */
+  NamedReference[] columns() { return columns; }


This is actually what pandas API on Spark implemented ... probably we should migrate to DSv2 eventually in the very far future ... cc @xinrong-databricks @ueshin @itholic FYI

### What changes were proposed in this pull request? Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index This PR adds `supportsIndex` interface that provides APIs to work with indexes. ### Why are the changes needed? Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc. ### Does this PR introduce _any_ user-facing change? yes, the following new APIs are added: - createIndex - dropIndex - indexExists - listIndexes New SQL syntax: ``` CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList] column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ] indexPropertyList: index_property_name = index_property_value [ , . . . ] DROP INDEX index_name ``` ### How was this patch tested? only interface is added for now. Tests will be added when doing the implementation Closes apache#33754 from huaxingao/index_interface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

* [SPARK-36556][SQL] Add DSV2 filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? Add DSV2 Filters and use these in V2 codepath. ### Why are the changes needed? The motivation of adding DSV2 filters: 1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided. 2. Improve nested column filter support. 3. Make the filters work better with the rest of the DSV2 APIs. ### Does this PR introduce _any_ user-facing change? Yes. The new V2 filters ### How was this patch tested? new test Closes #33803 from huaxingao/filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36760][SQL] Add interface SupportsPushDownV2Filters Co-Authored-By: DB Tsai d_tsaiapple.com Co-Authored-By: Huaxin Gao huaxin_gaoapple.com ### What changes were proposed in this pull request? This is the 2nd PR for V2 Filter support. This PR does the following: - Add interface SupportsPushDownV2Filters Future work: - refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them - For V2 file source: implement v2 filter -> parquet/orc filter. csv and Json don't have real filters, but also need to change the current code to have v2 filter -> `JacksonParser`/`UnivocityParser` - For V1 file source, keep what we currently have: v1 filter -> parquet/orc filter - We don't need v1filter.toV2 and v2filter.toV1 since we have two separate paths The reasons that we have reached the above conclusion: - The major motivation to implement V2Filter is to eliminate the unnecessary conversion between Catalyst types and Scala types when using Filters. - We provide this `SupportsPushDownV2Filters` in this PR so V2 data source (e.g. iceberg) can implement it and use V2 Filters - There are lots of work to implement v2 filters in the V2 file sources because of the following reasons: possible approaches for implementing V2Filter: 1. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter We don't need v1->v2 and v2->v1 problem with this approach: there are lots of code duplication 2. We will implement v2 filter -> parquet/orc filter file source v1: v1 filter -> v2 filter -> parquet/orc filter We will need V1 -> V2 This is the approach I am using in https://github.com/apache/spark/pull/33973 In that PR, I have v2 orc： v2 filter -> orc filter V1 orc： v1 -> v2 -> orc filter v2 csv: v2->v1, new UnivocityParser v1 csv: new UnivocityParser v2 Json: v2->v1, new JacksonParser v1 Json: new JacksonParser csv and Json don't have real filters, they just use filter references, should be OK to use either v1 and v2. Easier to use v1 because no need to change. I haven't finished parquet yet. The PR doesn't have the parquet V2Filter implementation, but I plan to have v2 parquet： v2 filter -> parquet filter v1 parquet： v1 -> v2 -> parquet filter Problem with this approach: 1. It's not easy to implement V1->V2 because V2 filter have `LiteralValue` and needs type info. We already lost the type information when we convert Expression filer to v1 filter. 2. parquet is OK Use Timestamp as example, parquet filter takes long for timestamp v2 parquet： v2 filter -> parquet filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Long） V1 parquet： v1 -> v2 -> parquet filter timestamp Expression （Long） -> v1 filter （timestamp） -> v2 filter （LiteralValue Long）-> parquet filter （Long） but we have problem for orc because orc filter takes java Timestamp v2 orc： v2 filter -> orc filter timestamp Expression （Long） -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） V1 orc： v1 -> v2 -> orc filter Expression （Long） -> v1 filter (timestamp) -> v2 filter （LiteralValue Long）-> parquet filter （Timestamp） This defeats the purpose of implementing v2 filters. 3. keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2: v2 filter -> v1 filter -> parquet/orc filter We will need V2 -> V1 we have similar problem as approach 2. So the conclusion is: approach 1 (keep what we have for file source v1: v1 filter -> parquet/orc filter file source v2 we will implement v2 filter -> parquet/orc filter) is better, but there are lots of code duplication. We will need to refactor `OrcFilters`, `ParquetFilters`, `JacksonParser`, `UnivocityParser` so both V1 file source and V2 file source can use them. ### Why are the changes needed? Use V2Filters to eliminate the unnecessary conversion between Catalyst types and Scala types. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Added new UT Closes #34001 from huaxingao/v2filter. Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37020][SQL] DS V2 LIMIT push down ### What changes were proposed in this pull request? Push down limit to data source for better performance ### Why are the changes needed? For LIMIT, e.g. `SELECT * FROM table LIMIT 10`, Spark retrieves all the data from table and then returns 10 rows. If we can push LIMIT to data source side, the data transferred to Spark will be dramatically reduced. ### Does this PR introduce _any_ user-facing change? Yes. new interface `SupportsPushDownLimit` ### How was this patch tested? new test Closes #34291 from huaxingao/pushdownLimit. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37038][SQL] DSV2 Sample Push Down ### What changes were proposed in this pull request? Push down Sample to data source for better performance. If Sample is pushed down, it will be removed from logical plan so it will not be applied at Spark any more. Current Plan without Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 157 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 157 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == Sample 0.0, 0.8, false, 157 +- RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Sample 0.0, 0.8, false, 157 +- *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$16dde4769 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: struct<col1:int,col2:int> ``` after Sample push down: ``` == Parsed Logical Plan == 'Project [*] +- 'Sample 0.0, 0.8, false, 187 +- 'UnresolvedRelation [postgresql, new_table], [], false == Analyzed Logical Plan == col1: int, col2: int Project [col1#163, col2#164] +- Sample 0.0, 0.8, false, 187 +- SubqueryAlias postgresql.new_table +- RelationV2[col1#163, col2#164] new_table == Optimized Logical Plan == RelationV2[col1#163, col2#164] new_table == Physical Plan == *(1) Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$165b57543 [col1#163,col2#164] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], PushedSample: TABLESAMPLE 0.0 0.8 false 187, ReadSchema: struct<col1:int,col2:int> ``` The new interface is implemented using JDBC for POC and end to end test. TABLESAMPLE is not supported by all the databases. It is implemented using postgresql in this PR. ### Why are the changes needed? Reduce IO and improve performance. For SAMPLE, e.g. `SELECT * FROM t TABLESAMPLE (1 PERCENT)`, Spark retrieves all the data from table and then return 1% rows. It will dramatically reduce the transferred data size and improve performance if we can push Sample to data source side. ### Does this PR introduce any user-facing change? Yes. new interface `SupportsPushDownTableSample` ### How was this patch tested? New test Closes #34451 from huaxingao/sample. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][SQL] Move compileAggregates from JDBCRDD to JdbcDialect ### What changes were proposed in this pull request? Currently, the method `compileAggregates` is a member of `JDBCRDD`. But it is not reasonable, because the JDBC source knowns how to compile aggregate expressions to itself's dialect well. ### Why are the changes needed? JDBC source knowns how to compile aggregate expressions to itself's dialect well. After this PR, we can extend the pushdown(e.g. aggregate) based on different dialect between different JDBC database. There are two situations: First, database A and B implement a different number of aggregate functions that meet the SQL standard. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implementation. ### How was this patch tested? Jenkins tests. Closes #34554 from beliefer/SPARK-37286. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37286][DOCS][FOLLOWUP] Fix the wrong parameter name for Javadoc ### What changes were proposed in this pull request? This PR fixes an issue that the Javadoc generation fails due to the wrong parameter name of a method added in SPARK-37286 (#34554). https://github.com/apache/spark/runs/4409267346?check_suite_focus=true#step:9:5081 ### Why are the changes needed? To keep the build clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #34801 from sarutak/followup-SPARK-37286. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com> * [SPARK-37262][SQL] Don't log empty aggregate and group by in JDBCScan ### What changes were proposed in this pull request? Currently, the empty pushed aggregate and pushed group by are logged in Explain for JDBCScan ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedAggregates: [], PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], PushedGroupby: [], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` After the fix, the JDBCSScan will be ``` Scan org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCScan$$anon$172e75786 [NAME#1,SALARY#2] PushedFilters: [IsNotNull(SALARY), GreaterThan(SALARY,100.00)], ReadSchema: struct<NAME:string,SALARY:decimal(20,2)> ``` ### Why are the changes needed? address this comment https://github.com/apache/spark/pull/34451#discussion_r740220800 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #34540 from huaxingao/aggExplain. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37483][SQL] Support push down top N to JDBC data source V2 ### What changes were proposed in this pull request? Currently, Spark supports push down limit to data source. However, in the user's scenario, limit must have the premise of order by. Because limit and order by are more valuable together. On the other hand, push down top N(same as order by ... limit N) outputs the data with basic order to Spark sort, the the sort of Spark may have some performance improvement. ### Why are the changes needed? 1. push down top N is very useful for users scenario. 2. push down top N could improves the performance of sort. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the physical execute. ### How was this patch tested? New tests. Closes #34918 from beliefer/SPARK-37483. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL] Support datasource v2 complete aggregate pushdown ### What changes were proposed in this pull request? Currently , Spark supports push down aggregate with partial-agg and final-agg . For some data source (e.g. JDBC ) , we can avoid partial-agg and final-agg by running completely on database. ### Why are the changes needed? Improve performance for aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #34904 from beliefer/SPARK-37644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37627][SQL] Add sorted column in BucketTransform ### What changes were proposed in this pull request? In V1, we can create table with sorted bucket like the following: ``` sql("CREATE TABLE tbl(a INT, b INT) USING parquet " + "CLUSTERED BY (a) SORTED BY (b) INTO 5 BUCKETS") ``` However, creating table with sorted bucket in V2 failed with Exception `org.apache.spark.sql.AnalysisException: Cannot convert bucketing with sort columns to a transform.` ### Why are the changes needed? This PR adds sorted column in BucketTransform so we can create table in V2 with sorted bucket ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new UT Closes #34879 from huaxingao/sortedBucket. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37789][SQL] Add a class to represent general aggregate functions in DS V2 ### What changes were proposed in this pull request? There are a lot of aggregate functions in SQL and it's a lot of work to add them one by one in the DS v2 API. This PR proposes to add a new `GeneralAggregateFunc` class to represent all the general SQL aggregate functions. Since it's general, Spark doesn't know its aggregation buffer and can only push down the aggregation to the source completely. As an example, this PR also translates `AVG` to `GeneralAggregateFunc` and pushes it to JDBC V2. ### Why are the changes needed? To add aggregate functions in DS v2 easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? JDBC v2 test Closes #35070 from cloud-fan/agg. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37644][SQL][FOLLOWUP] When partition column is same as group by key, pushing down aggregate completely ### What changes were proposed in this pull request? When JDBC option specifying the "partitionColumn" and it's the same as group by key, the aggregate push-down should be completely. ### Why are the changes needed? Improve the datasource v2 complete aggregate pushdown. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35052 from beliefer/SPARK-37644-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Translate more standard aggregate functions for pushdown ### What changes were proposed in this pull request? Currently, Spark aggregate pushdown will translate some standard aggregate functions, so that compile these functions to adapt specify database. After this job, users could override `JdbcDialect.compileAggregate` to implement some standard aggregate functions supported by some database. This PR just translate the ANSI standard aggregate functions. The mainstream database supports these functions show below: | Name | ClickHouse | Presto | Teradata | Snowflake | Oracle | Postgresql | Vertica | MySQL | RedShift | ElasticSearch | Impala | Druid | SyBase | DB2 | H2 | Exasol | Mariadb | Phoenix | Yellowbrick | Singlestore | Influxdata | Dolphindb | Intersystems | |-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------|-------| | `VAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `VAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | No | Yes | Yes | No | Yes | Yes | | `STDDEV_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `STDDEV_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | | `COVAR_POP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | Yes | Yes | No | | `COVAR_SAMP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | No | No | | `CORR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | No | No | No | No | Yes | No | Yes | Yes | No | No | No | No | No | Yes | No | Because some aggregate functions will be converted by Optimizer show below, this PR no need to match them. |Input|Parsed|Optimized| |------|--------------------|----------| |`Every`| `aggregate.BoolAnd` |`Min`| |`Any`| `aggregate.BoolOr` |`Max`| |`Some`| `aggregate.BoolOr` |`Max`| ### Why are the changes needed? Make the implement of `*Dialect` could extends the aggregate functions by override `JdbcDialect.compileAggregate`. ### Does this PR introduce _any_ user-facing change? Yes. Users could pushdown more aggregate functions. ### How was this patch tested? Exists tests. Closes #35101 from beliefer/SPARK-37527-new2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com> * [SPARK-37734][SQL][TESTS] Upgrade h2 from 1.4.195 to 2.0.204 ### What changes were proposed in this pull request? This PR aims to upgrade `com.h2database` from 1.4.195 to 2.0.202 ### Why are the changes needed? Fix one vulnerability, ref: https://www.tenable.com/cve/CVE-2021-23463 ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #35013 from beliefer/SPARK-37734. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37527][SQL] Compile `COVAR_POP`, `COVAR_SAMP` and `CORR` in `H2Dialet` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35101 translate `COVAR_POP`, `COVAR_SAMP` and `CORR`, but the H2 lower version cannot support them. After https://github.com/apache/spark/pull/35013, we can compile the three aggregate functions in `H2Dialet` now. ### Why are the changes needed? Supplement the implement of `H2Dialet`. ### Does this PR introduce _any_ user-facing change? 'Yes'. Spark could complete push-down `COVAR_POP`, `COVAR_SAMP` and `CORR` into H2. ### How was this patch tested? Test updated. Closes #35145 from beliefer/SPARK-37527_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL] DS V2 supports partial aggregate push-down `AVG` ### What changes were proposed in this pull request? `max`，`min`，`count`，`sum`，`avg` are the most commonly used aggregation functions. Currently, DS V2 supports complete aggregate push-down of `avg`. But, supports partial aggregate push-down of `avg` is very useful. The aggregate push-down algorithm is: 1. Spark translates group expressions of `Aggregate` to DS V2 `Aggregation`. 2. Spark calls `supportCompletePushDown` to check if it can completely push down aggregate. 3. If `supportCompletePushDown` returns true, we preserves the aggregate expressions as final aggregate expressions. Otherwise, we split `AVG` into 2 functions: `SUM` and `COUNT`. 4. Spark translates final aggregate expressions and group expressions of `Aggregate` to DS V2 `Aggregation` again, and pushes the `Aggregation` to JDBC source. 5. Spark constructs the final aggregate. ### Why are the changes needed? DS V2 supports partial aggregate push-down `AVG` ### Does this PR introduce _any_ user-facing change? 'Yes'. DS V2 could partial aggregate push-down `AVG` ### How was this patch tested? New tests. Closes #35130 from beliefer/SPARK-37839. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface ### What changes were proposed in this pull request? Indexes are database objects created on one or more columns of a table. Indexes are used to improve query performance. A detailed explanation of database index is here https://en.wikipedia.org/wiki/Database_index This PR adds `supportsIndex` interface that provides APIs to work with indexes. ### Why are the changes needed? Many data sources support index to improvement query performance. In order to take advantage of the index support in data source, this `supportsIndex` interface is added to let user to create/drop an index, list indexes, etc. ### Does this PR introduce _any_ user-facing change? yes, the following new APIs are added: - createIndex - dropIndex - indexExists - listIndexes New SQL syntax: ``` CREATE [index_type] INDEX [index_name] ON [TABLE] table_name (column_index_property_list)[OPTIONS indexPropertyList] column_index_property_list: column_name [OPTIONS(indexPropertyList)] [ , . . . ] indexPropertyList: index_property_name = index_property_value [ , . . . ] DROP INDEX index_name ``` ### How was this patch tested? only interface is added for now. Tests will be added when doing the implementation Closes #33754 from huaxingao/index_interface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36913][SQL] Implement createIndex and IndexExists in DS V2 JDBC (MySQL dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists` in DS V2 JDBC ### Why are the changes needed? This is a subtask of the V2 Index support. I am implementing index support for DS V2 JDBC so we can have a POC and an end to end testing. This PR implements `createIndex` and `IndexExists`. Next PR will implement `listIndexes` and `dropIndex`. I intentionally make the PR small so it's easier to review. Index is not supported by h2 database and create/drop index are not standard SQL syntax. This PR only implements `createIndex` and `IndexExists` in `MySQL` dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExist` in DS V2 JDBC ### How was this patch tested? new test Closes #34164 from huaxingao/createIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36914][SQL] Implement dropIndex and listIndexes in JDBC (MySQL dialect) ### What changes were proposed in this pull request? This PR implements `dropIndex` and `listIndexes` in MySQL dialect ### Why are the changes needed? As a subtask of the V2 Index support, this PR completes the implementation for JDBC V2 index support. ### Does this PR introduce _any_ user-facing change? Yes, `dropIndex/listIndexes` in DS V2 JDBC ### How was this patch tested? new tests Closes #34236 from huaxingao/listIndexJDBC. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37343][SQL] Implement createIndex, IndexExists and dropIndex in JDBC (Postgres dialect) ### What changes were proposed in this pull request? Implementing `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC for Postgres dialect. ### Why are the changes needed? This is a subtask of the V2 Index support. This PR implements `createIndex`, `IndexExists` and `dropIndex`. After review for some changes in this PR, I will create new PR for `listIndexs`, or add it in this PR. This PR only implements `createIndex`, `IndexExists` and `dropIndex` in Postgres dialect. ### Does this PR introduce _any_ user-facing change? Yes, `createIndex`/`IndexExists`/`dropIndex` in DS V2 JDBC ### How was this patch tested? New test. Closes #34673 from dchvn/Dsv2_index_postgres. Authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL] Compile aggregate functions of build-in JDBC dialect ### What changes were proposed in this pull request? DS V2 translate a lot of standard aggregate functions. Currently, only H2Dialect compile these standard aggregate functions. This PR compile these standard aggregate functions for other build-in JDBC dialect. ### Why are the changes needed? Make build-in JDBC dialect support complete aggregate push-down. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in JDBC dialect. ### How was this patch tested? New tests. Closes #35166 from beliefer/SPARK-37867. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL][FOLLOWUP] Support cascade mode for JDBC V2 ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35246 support `cascade` mode for dropNamespace API. This PR followup https://github.com/apache/spark/pull/35246 to make JDBC V2 respect `cascade`. ### Why are the changes needed? Let JDBC V2 respect `cascade`. ### Does this PR introduce _any_ user-facing change? Yes. Users could manipulate `drop namespace` with `cascade` on JDBC V2. ### How was this patch tested? New tests. Closes #35271 from beliefer/SPARK-37929-followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38035][SQL] Add docker tests for build-in JDBC dialect ### What changes were proposed in this pull request? Currently, Spark only have `PostgresNamespaceSuite` to test DS V2 namespace in docker environment. But missing tests for other build-in JDBC dialect (e.g. Oracle, MySQL). This PR also found some compatible issue. For example, the JDBC api `conn.getMetaData.getSchemas` works bad for MySQL. ### Why are the changes needed? We need add tests for other build-in JDBC dialect. ### Does this PR introduce _any_ user-facing change? 'No'. Just add tests which face developers. ### How was this patch tested? New tests. Closes #35333 from beliefer/SPARK-38035. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38054][SQL] Supports list namespaces in JDBC v2 MySQL dialect ### What changes were proposed in this pull request? Currently, `JDBCTableCatalog.scala` query namespaces show below. ``` val schemaBuilder = ArrayBuilder.make[Array[String]] val rs = conn.getMetaData.getSchemas() while (rs.next()) { schemaBuilder += Array(rs.getString(1)) } schemaBuilder.result ``` But the code cannot get any information when using MySQL JDBC driver. This PR uses `SHOW SCHEMAS` to query namespaces of MySQL. This PR also fix other issues below: - Release the docker tests in `MySQLNamespaceSuite.scala`. - Because MySQL doesn't support create comment of schema, let's throws `SQLFeatureNotSupportedException`. - Because MySQL doesn't support `DROP SCHEMA` in `RESTRICT` mode, let's throws `SQLFeatureNotSupportedException`. - Reactor `JdbcUtils.executeQuery` to avoid `java.sql.SQLException: Operation not allowed after ResultSet closed`. ### Why are the changes needed? MySQL dialect supports query namespaces. ### Does this PR introduce _any_ user-facing change? 'Yes'. Some API changed. ### How was this patch tested? New tests. Closes #35355 from beliefer/SPARK-38054. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36351][SQL] Refactor filter push down in file source v2 ### What changes were proposed in this pull request? Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source. Changes in this PR: When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there. The benefit of doing this: - we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain. - we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter. - By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters) - Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters. In order to do this, we will have the following changes - add `pushFilters` in file source v2. In this method: - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning. - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down. - partition filters are used for partition pruning. - file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources. ### Why are the changes needed? see section one ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33650 from huaxingao/partition_filter. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-36645][SQL] Aggregate (Min/Max/Count) push down for Parquet ### What changes were proposed in this pull request? Push down Min/Max/Count to Parquet with the following restrictions: - nested types such as Array, Map or Struct will not be pushed down - Timestamp not pushed down because INT96 sort order is undefined, Parquet doesn't return statistics for INT96 - If the aggregate column is on partition column, only Count will be pushed, Min or Max will not be pushed down because Parquet doesn't return max/min for partition column. - If somehow the file doesn't have stats for the aggregate columns, Spark will throw Exception. - Currently, if filter/GROUP BY is involved, Min/Max/Count will not be pushed down, but the restriction will be lifted if the filter or GROUP BY is on partition column (https://issues.apache.org/jira/browse/SPARK-36646 and https://issues.apache.org/jira/browse/SPARK-36647) ### Why are the changes needed? Since parquet has the statistics information for min, max and count, we want to take advantage of this info and push down Min/Max/Count to parquet layer for better performance. ### Does this PR introduce _any_ user-facing change? Yes, `SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED` was added. If sets to true, we will push down Min/Max/Count to Parquet. ### How was this patch tested? new test suites Closes #33639 from huaxingao/parquet_agg. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-34960][SQL] Aggregate push down for ORC ### What changes were proposed in this pull request? This PR is to add aggregate push down feature for ORC data source v2 reader. At a high level, the PR does: * The supported aggregate expression is MIN/MAX/COUNT same as [Parquet aggregate push down](https://github.com/apache/spark/pull/33639). * BooleanType, ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DateType are allowed in MIN/MAXX aggregate push down. All other columns types are not allowed in MIN/MAX aggregate push down. * All columns types are supported in COUNT aggregate push down. * Nested column's sub-fields are disallowed in aggregate push down. * If the file does not have valid statistics, Spark will throw exception and fail query. * If aggregate has filter or group-by column, aggregate will not be pushed down. At code level, the PR does: * `OrcScanBuilder`: `pushAggregation()` checks whether the aggregation can be pushed down. The most checking logic is shared between Parquet and ORC, extracted into `AggregatePushDownUtils.getSchemaForPushedAggregation()`. `OrcScanBuilder` will create a `OrcScan` with aggregation and aggregation data schema. * `OrcScan`: `createReaderFactory` creates a ORC reader factory with aggregation and schema. Similar change with `ParquetScan`. * `OrcPartitionReaderFactory`: `buildReaderWithAggregates` creates a ORC reader with aggregate push down (i.e. read ORC file footer to process columns statistics, instead of reading actual data in the file). `buildColumnarReaderWithAggregates` creates a columnar ORC reader similarly. Both delegate the real work to read footer in `OrcUtils.createAggInternalRowFromFooter`. * `OrcUtils.createAggInternalRowFromFooter`: reads ORC file footer to process columns statistics (real heavy lift happens here). Similar to `ParquetUtils.createAggInternalRowFromFooter`. Leverage utility method such as `OrcFooterReader.readStatistics`. * `OrcFooterReader`: `readStatistics` reads the ORC `ColumnStatistics[]` into Spark `OrcColumnStatistics`. The transformation is needed here, because ORC `ColumnStatistics[]` stores all columns statistics in a flatten array style, and hard to process. Spark `OrcColumnStatistics` stores the statistics in nested tree structure (e.g. like `StructType`). This is used by `OrcUtils.createAggInternalRowFromFooter` * `OrcColumnStatistics`: the easy-to-manipulate structure for ORC `ColumnStatistics`. This is used by `OrcFooterReader.readStatistics`. ### Why are the changes needed? To improve the performance of query with aggregate. ### Does this PR introduce _any_ user-facing change? Yes. A user-facing config `spark.sql.orc.aggregatePushdown` is added to control enabling/disabling the aggregate push down for ORC. By default the feature is disabled. ### How was this patch tested? Added unit test in `FileSourceAggregatePushDownSuite.scala`. Refactored all unit tests in https://github.com/apache/spark/pull/33639, and it now works for both Parquet and ORC. Closes #34298 from c21/orc-agg. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> * [SPARK-37960][SQL] A new framework to represent catalyst expressions in DS v2 APIs ### What changes were proposed in this pull request? This PR provides a new framework to represent catalyst expressions in DS v2 APIs. `GeneralSQLExpression` is a general SQL expression to represent catalyst expression in DS v2 API. `ExpressionSQLBuilder` is a builder to generate `GeneralSQLExpression` from catalyst expressions. `CASE ... WHEN ... ELSE ... END` is just the first use case. This PR also supports aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Why are the changes needed? Support aggregate push down with `CASE ... WHEN ... ELSE ... END`. ### Does this PR introduce _any_ user-facing change? Yes. Users could use `CASE ... WHEN ... ELSE ... END` with aggregate push down. ### How was this patch tested? New tests. Closes #35248 from beliefer/SPARK-37960. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37867][SQL][FOLLOWUP] Compile aggregate functions for build-in DB2 dialect ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35166. The previously referenced DB2 documentation is incorrect, resulting in the lack of compile that supports some aggregate functions. The correct documentation is https://www.ibm.com/docs/en/db2/11.5?topic=af-regression-functions-regr-avgx-regr-avgy-regr-count ### Why are the changes needed? Make build-in DB2 dialect support complete aggregate push-down more aggregate functions. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could use complete aggregate push-down with build-in DB2 dialect. ### How was this patch tested? New tests. Closes #35520 from beliefer/SPARK-37867_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36568][SQL] Better FileScan statistics estimation ### What changes were proposed in this pull request? This PR modifies `FileScan.estimateStatistics()` to take the read schema into account. ### Why are the changes needed? `V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UT. Closes #33825 from peter-toth/SPARK-36568-scan-statistics-estimation. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37929][SQL] Support cascade mode for `dropNamespace` API ### What changes were proposed in this pull request? This PR adds a new API `dropNamespace(String[] ns, boolean cascade)` to replace the existing one: Add a boolean parameter `cascade` that supports deleting all the Namespaces and Tables under the namespace. Also include changing the implementations and tests that are relevant to this API. ### Why are the changes needed? According to [#cmt](https://github.com/apache/spark/pull/35202#discussion_r784463563), the current `dropNamespace` API doesn't support cascade mode. So this PR replaces that to support cascading. If cascade is set True, delete all namespaces and tables under the namespace. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #35246 from dchvn/change_dropnamespace_api. Authored-by: dch nguyen <dchvn.dgd@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38196][SQL] Refactor framework so as JDBC dialect could compile expression by self way ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35248 provides a new framework to represent catalyst expressions in DS V2 APIs. Because the framework translate all catalyst expressions to a unified SQL string and cannot keep compatibility between different JDBC database, the framework works not good. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 expression. Second, The JDBC dialect could compile DS V2 expression to different SQL syntax. The java doc looks show below: ![image](https://user-images.githubusercontent.com/8486025/156579584-f56cafb5-641f-4c5b-a06e-38f4369051c3.png) ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35494 from beliefer/SPARK-37960_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38361][SQL] Add factory method `getConnection` into `JDBCDialect` ### What changes were proposed in this pull request? At present, the parameter of the factory method for obtaining JDBC connection is empty because the JDBC URL of some databases is fixed and unique. However, for databases such as ClickHouse, connection is related to the shard node. So I think the parameter form of `getConnection: Partition = > Connection` is more general. This PR adds factory method `getConnection` into `JDBCDialect` according to https://github.com/apache/spark/pull/35696#issuecomment-1058060107. ### Why are the changes needed? Make factory method `getConnection` more general. ### Does this PR introduce _any_ user-facing change? 'No'. Just inner change. ### How was this patch tested? Exists test. Closes #35727 from beliefer/SPARK-38361_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code format * [SPARK-38560][SQL] If `Sum`, `Count`, `Any` accompany with distinct, cannot do partial agg push down ### What changes were proposed in this pull request? Spark could partial push down sum(distinct col), count(distinct col) if data source have multiple partitions, and Spark will sum the value again. So the result may not correctly. ### Why are the changes needed? Fix the bug push down sum(distinct col), count(distinct col) to data source and return incorrect result. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users will see the correct behavior. ### How was this patch tested? New tests. Closes #35873 from beliefer/SPARK-38560. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-36718][SQL] Only collapse projects if we don't duplicate expensive expressions ### What changes were proposed in this pull request? The `CollapseProject` rule can combine adjacent projects and merge the project lists. The key idea behind this rule is that the evaluation of project is relatively expensive, and that expression evaluation is cheap and that the expression duplication caused by this rule is not a problem. This last assumption is, unfortunately, not always true: - A user can invoke some expensive UDF, this now gets invoked more often than originally intended. - A projection is very cheap in whole stage code generation. The duplication caused by `CollapseProject` does more harm than good here. This PR addresses this problem, by only collapsing projects when it does not duplicate expensive expressions. In practice this means an input reference may only be consumed once, or when its evaluation does not incur significant overhead (currently attributes, nested column access, aliases & literals fall in this category). ### Why are the changes needed? We have seen multiple complains about `CollapseProject` in the past, due to it may duplicate expensive expressions. The most recent one is https://github.com/apache/spark/pull/33903 . ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT and existing test Closes #33958 from cloud-fan/collapse. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL] Refactor framework so as JDBC dialect could compile filter by self way ### What changes were proposed in this pull request? Currently, Spark DS V2 could push down filters into JDBC source. However, only the most basic form of filter is supported. On the other hand, some JDBC source could not compile the filters by themselves way. This PR reactor the framework so as JDBC dialect could compile expression by self way. First, The framework translate catalyst expressions to DS V2 filters. Second, The JDBC dialect could compile DS V2 filters to different SQL syntax. ### Why are the changes needed? Make the framework be more common use. ### Does this PR introduce _any_ user-facing change? 'No'. The feature is not released. ### How was this patch tested? Exists tests. Closes #35768 from beliefer/SPARK-38432_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Supplement test case for overflow and add comments ### What changes were proposed in this pull request? This PR follows up https://github.com/apache/spark/pull/35768 and improves the code. 1. Supplement test case for overflow 2. Not throw IllegalArgumentException 3. Improve V2ExpressionSQLBuilder 4. Add comments in V2ExpressionBuilder ### Why are the changes needed? Supplement test case for overflow and add comments. ### Does this PR introduce _any_ user-facing change? 'No'. V2 aggregate pushdown not released yet. ### How was this patch tested? New tests. Closes #35933 from beliefer/SPARK-38432_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38533][SQL] DS V2 aggregate push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 aggregate push-down doesn't supports project with alias. Refer https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala#L96 This PR let it works good with alias. **The first example:** the origin plan show below: ``` Aggregate [DEPT#0], [DEPT#0, sum(mySalary#8) AS total#14] +- Project [DEPT#0, SALARY#2 AS mySalary#8] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession77978658,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions5f8da82) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#0, SUM(SALARY)#18 AS sum(SALARY#2)#13 AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [DEPT#0], [DEPT#0, sum(cast(SUM(SALARY)#18 as decimal(20,2))) AS total#14] +- RelationV2[DEPT#0, SUM(SALARY)#18] test.employee ``` **The second example:** the origin plan show below: ``` Aggregate [myDept#33], [myDept#33, sum(mySalary#34) AS total#40] +- Project [DEPT#25 AS myDept#33, SALARY#27 AS mySalary#34] +- ScanBuilderHolder [DEPT#25, NAME#26, SALARY#27, BONUS#28], RelationV2[DEPT#25, NAME#26, SALARY#27, BONUS#28] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession25c4f621,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions345d641e) ``` If we can complete push down the aggregate, then the plan will be: ``` Project [DEPT#25 AS myDept#33, SUM(SALARY)#44 AS sum(SALARY#27)#39 AS total#40] +- RelationV2[DEPT#25, SUM(SALARY)#44] test.employee ``` If we can partial push down the aggregate, then the plan will be: ``` Aggregate [myDept#33], [DEPT#25 AS myDept#33, sum(cast(SUM(SALARY)#56 as decimal(20,2))) AS total#52] +- RelationV2[DEPT#25, SUM(SALARY)#56] test.employee ``` ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 aggregate push-down supports project with alias. ### How was this patch tested? New tests. Closes #35932 from beliefer/SPARK-38533_new. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * code foramt * [SPARK-37483][SQL][FOLLOWUP] Rename `pushedTopN` to `PushedTopN` and improve JDBCV2Suite ### What changes were proposed in this pull request? This PR fix three issues. **First**, create method `checkPushedInfo` and `checkSortRemoved` to reuse code. **Second**, remove method `checkPushedLimit`, because `checkPushedInfo` can cover it. **Third**, rename `pushedTopN` to `PushedTopN`, so as consistent with other pushed information. ### Why are the changes needed? Reuse code and let pushed information more correctly. ### Does this PR introduce _any_ user-facing change? 'No'. New feature and improve the tests. ### How was this patch tested? Adjust existing tests. Closes #35921 from beliefer/SPARK-37483_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38644][SQL] DS V2 topN push-down supports project with alias ### What changes were proposed in this pull request? Currently, Spark DS V2 topN push-down doesn't supports project with alias. This PR let it works good with alias. **Example**: the origin plan show below: ``` Sort [mySalary#10 ASC NULLS FIRST], true +- Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` and `sortOrders` of `JDBCScanBuilder` are empty. If we can push down the top n, then the plan will be: ``` Project [NAME#1, SALARY#2 AS mySalary#10] +- ScanBuilderHolder [DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4], RelationV2[DEPT#0, NAME#1, SALARY#2, BONUS#3, IS_MANAGER#4] test.employee, JDBCScanBuilder(org.apache.spark.sql.test.TestSparkSession7fd4b9ec,StructType(StructField(DEPT,IntegerType,true),StructField(NAME,StringType,true),StructField(SALARY,DecimalType(20,2),true),StructField(BONUS,DoubleType,true),StructField(IS_MANAGER,BooleanType,true)),org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions3c8e4a82) ``` The `pushedLimit` of `JDBCScanBuilder` will be `1` and `sortOrders` of `JDBCScanBuilder` will be `SALARY ASC NULLS FIRST`. ### Why are the changes needed? Alias is more useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see DS V2 topN push-down supports project with alias. ### How was this patch tested? New tests. Closes #35961 from beliefer/SPARK-38644. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38391][SQL] Datasource v2 supports partial topN push-down ### What changes were proposed in this pull request? Currently , Spark supports push down topN completely . But for some data source (e.g. JDBC ) that have multiple partition , we should preserve partial push down topN. ### Why are the changes needed? Make behavior of sort pushdown correctly. ### Does this PR introduce _any_ user-facing change? 'No'. Just change the inner implement. ### How was this patch tested? New tests. Closes #35710 from beliefer/SPARK-38391. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL] Support push down Cast to JDBC data source V2 ### What changes were proposed in this pull request? Cast is very useful and Spark always use Cast to convert data type automatically. ### Why are the changes needed? Let more aggregates and filters could be pushed down. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR after cut off 3.3.0. ### How was this patch tested? New tests. Closes #35947 from beliefer/SPARK-38633. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38432][SQL][FOLLOWUP] Add test case for push down filter with alias ### What changes were proposed in this pull request? DS V2 pushdown predicates to data source supports column with alias. But Spark missing the test case for push down filter with alias. ### Why are the changes needed? Add test case for push down filter with alias ### Does this PR introduce _any_ user-facing change? 'No'. Just add a test case. ### How was this patch tested? New tests. Closes #35988 from beliefer/SPARK-38432_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38633][SQL][FOLLOWUP] JDBCSQLBuilder should build cast to type of databases ### What changes were proposed in this pull request? DS V2 supports push down CAST to database. The current implement only uses the typeName of DataType. For example: `Cast(column, StringType)` will be build to `CAST(column AS String)`. But it should be `CAST(column AS TEXT)` for Postgres or `CAST(column AS VARCHAR2(255))` for Oracle. ### Why are the changes needed? Improve the implement of push down CAST. ### Does this PR introduce _any_ user-facing change? 'No'. Just new feature. ### How was this patch tested? Exists tests Closes #35999 from beliefer/SPARK-38633_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37839][SQL][FOLLOWUP] Check overflow when DS V2 partial aggregate push-down `AVG` ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/35130 supports partial aggregate push-down `AVG` for DS V2. The behavior doesn't consistent with `Average` if occurs overflow in ansi mode. This PR closely follows the implement of `Average` to respect overflow in ansi mode. ### Why are the changes needed? Make the behavior consistent with `Average` if occurs overflow in ansi mode. ### Does this PR introduce _any_ user-facing change? 'Yes'. Users could see the exception about overflow throws in ansi mode. ### How was this patch tested? New tests. Closes #35320 from beliefer/SPARK-37839_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-37960][SQL][FOLLOWUP] Make the testing CASE WHEN query more reasonable ### What changes were proposed in this pull request? Some testing CASE WHEN queries are not carefully written and do not make sense. In the future, the optimizer may get smarter and get rid of the CASE WHEN completely, and then we loose test coverage. This PR updates some CASE WHEN queries to make them more reasonable. ### Why are the changes needed? future-proof test coverage. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? N/A Closes #36032 from beliefer/SPARK-37960_followup2. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38761][SQL] DS V2 supports push down misc non-aggregate functions ### What changes were proposed in this pull request? Currently, Spark have some misc non-aggregate functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L362. These functions show below: `abs`, `coalesce`, `nullif`, `CASE WHEN` DS V2 should supports push down these misc non-aggregate functions. Because DS V2 already support push down `CASE WHEN`, so this PR no need do the job again. Because `nullif` extends `RuntimeReplaceable`, so this PR no need do the job too. ### Why are the changes needed? DS V2 supports push down misc non-aggregate functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36039 from beliefer/SPARK-38761. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * [SPARK-38865][SQL][DOCS] Update document of JDBC options for `pushDownAggregate` and `pushDownLimit` ### What changes were proposed in this pull request? Because the DS v2 pushdown framework refactored, we need to add more doc in `sql-data-sources-jdbc.md` to reflect the new changes. ### Why are the changes needed? Add doc for new changes for `pushDownAggregate` and `pushDownLimit`. ### Does this PR introduce _any_ user-facing change? 'No'. Updated for new feature. ### How was this patch tested? N/A Closes #36152 from beliefer/SPARK-38865. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: huaxingao <huaxin_gao@apple.com> * [SPARK-38855][SQL] DS V2 supports push down math functions ### What changes were proposed in this pull request? Currently, Spark have some math functions of ANSI standard. Please refer https://github.com/apache/spark/blob/2f8613f22c0750c00cf1dcfb2f31c431d8dc1be7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L388 These functions show below: `LN`, `EXP`, `POWER`, `SQRT`, `FLOOR`, `CEIL`, `WIDTH_BUCKET` The mainstream databases support these functions show below. | 函数 | PostgreSQL | ClickHouse | H2 | MySQL | Oracle | Redshift | Presto | Teradata | Snowflake | DB2 | Vertica | Exasol | SqlServer | Yellowbrick | Impala | Mariadb | Druid | Pig | SQLite | Influxdata | Singlestore | ElasticSearch | | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | `LN` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `EXP` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `POWER` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | No | Yes | Yes | Yes | Yes | | `SQRT` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `FLOOR` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `CEIL` | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | | `WIDTH_BUCKET` | Yes | No | No | No | Yes | No | Yes | Yes | Yes | Yes | Yes | No | No | No | Yes | No | No | No | No | No | No | No | DS V2 should supports push down these math functions. ### Why are the changes needed? DS V2 supports push down math functions ### Does this PR introduce _any_ user-facing change? 'No'. New feature. ### How was this patch tested? New tests. Closes #36140 from beliefer/SPARK-38855. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> * update spark version to r61 Co-authored-by: Huaxin Gao <huaxin_gao@apple.com> Co-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: dch nguyen <dgd_contributor@viettel.com.vn> Co-authored-by: Cheng Su <chengsu@fb.com> Co-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: dch nguyen <dchvn.dgd@gmail.com>

huleilei · 2022-09-23T08:54:57Z

Hello all, I want to know what indexes are in the table. But SHOW INDEX syntax is not supported. So I think the SHOW INDEX syntax should be added in the Spark code. Please let me know if you have any thoughts.
In addition, whether the index reconstruction function needs to be supported. Different data sources have different requirements. MySQL does not need to rebuild the index data. But it is supported in pg. DataSources such as delta and iceberg also need to rebuild index. So I suggest providing interfaces to support the update of index data. Thanks.

eg postgresql：
https://www.postgresql.org/docs/13/sql-reindex.html Thanks.
REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name

huaxingao added 2 commits August 16, 2021 10:45

[SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface

a84794c

fix import

b89b321

github-actions bot added the SQL label Aug 16, 2021

viirya reviewed Aug 16, 2021

View reviewed changes

addess comments

f9f4e37

viirya reviewed Aug 16, 2021

View reviewed changes

HyukjinKwon reviewed Aug 17, 2021

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/index/Index.java Outdated Show resolved Hide resolved

c21 reviewed Aug 17, 2021

View reviewed changes

add columnPropertyList in createIndex

d4c1931

cloud-fan reviewed Sep 3, 2021

View reviewed changes

cloud-fan reviewed Sep 28, 2021

View reviewed changes

make SupportsIndex extend Table

120b477

cloud-fan reviewed Sep 28, 2021

View reviewed changes

address comments

f6ca2e8

cloud-fan approved these changes Sep 29, 2021

View reviewed changes

cloud-fan closed this in d2bb359 Sep 29, 2021

huaxingao deleted the index_interface branch September 29, 2021 03:51

HyukjinKwon reviewed Oct 1, 2021

View reviewed changes

		class IndexAlreadyExistsException(indexName: String)
		extends AnalysisException(s"Index '$indexName' already exists")

	* Checks whether an index exists.
	* Checks whether an index exists in this table.

[SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface #33754

[SPARK-36526][SQL] DSV2 Index Support: Add supportsIndex interface #33754

Conversation

huaxingao commented Aug 16, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 16, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Aug 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2021

SparkQA commented Aug 16, 2021

huaxingao commented Aug 16, 2021

SparkQA commented Aug 16, 2021

SparkQA commented Aug 17, 2021

HyukjinKwon commented Aug 17, 2021

huaxingao commented Aug 17, 2021

HyukjinKwon commented Aug 17, 2021

c21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LuciferYang Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huaxingao commented Aug 18, 2021

huaxingao commented Sep 1, 2021

SparkQA commented Sep 1, 2021

SparkQA commented Sep 1, 2021

SparkQA commented Sep 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2021

SparkQA commented Sep 8, 2021

SparkQA commented Sep 8, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 28, 2021

SparkQA commented Sep 28, 2021

SparkQA commented Sep 28, 2021

SparkQA commented Sep 28, 2021

SparkQA commented Sep 28, 2021

SparkQA commented Sep 29, 2021

cloud-fan commented Sep 29, 2021

dongjoon-hyun commented Sep 29, 2021 • edited Loading

huaxingao commented Sep 29, 2021

Choose a reason for hiding this comment

huleilei commented Sep 23, 2022

huaxingao commented Aug 16, 2021 •

edited

Loading

viirya Aug 16, 2021 •

edited

Loading

LuciferYang Aug 24, 2021 •

edited

Loading

dongjoon-hyun commented Sep 29, 2021 •

edited

Loading